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Abstract Clustering is a concept used in a huge variety of applications. We review a 
conceptually very simple algorithm for hierarchical clustering called in the follow- 
ing the mutual information clustering (MIC) algorithm. It uses mutual information 
(Ml) as a similarity measure and exploits its grouping property: The Ml between 
three objects X,Y, and Z is equal to the sum of the Ml between X and Y, plus the 
Ml between Z and the combined object {XY). We use MIC both in the Shannon 
(probabilistic) version of information theory, where the "objects" are probability 
distributions represented by random samples, and in the Kolmogorov (algorithmic) 
version, where the "objects" are symbol sequences. We apply our method to the 
construction of phylogenetic trees from mitochondrial DNA sequences and we re- 
construct the fetal ECG from the output of independent components analysis (ICA) 
applied to the ECG of a pregnant woman. 



1 Introduction 

Classification or organizing of data is a crucial task in all scientific disciplines. It 
is one of the most fundamental mechanism of understanding and learning ||T9l . De- 
pending on the problem, classification can be exclusive or overlapping, supervised 
or unsupervised. In the following we will be interested only in exclusive unsuper- 
vised classification. This type of classification is usually called clustering or cluster 
analysis. 



Alexander Kraskov 

UCL Institute of Neurology, Queen Square, London, WCIN 3BG, UK, e-mail: 
lakraskovBion . ucl . ac . uk i 

Peter Grassberger 

Department of Physics & Astronomy and Institute for Biocomplexity & Informatics, Uni- 
versity of Calgaiy, 2500 University Drive NW, Calgary AB, Canada T2N 1N4, e-mail: 
|pgrassbe@ucalgary . ca| 



1 



2 



Alexander Kraskov and Peter Grassberger 



An instance of a clustering problem consists of a set of objects and a set of prop- 
erties (called characteristic vector) for each object. The goal of clustering is the 
separation of objects into groups using only the characteristic vectors. Indeed, in 
general only certain aspects of the characteristic vectors will be relevant, and ex- 
tracting these relevant features is one field where mutual information (MI) plays 
a major role ll36l . but we shall not deal with this here. Cluster analysis organizes 
data either as a single grouping of individuals into non-overlapping clusters or as a 
hierarchy of nested partitions. The former approach is called parti tional clustering 
(PC), the latter one is hierarchical clustering (HC). One of the main features of HC 
methods is the visual impact of the tree or dendrogram which enables one to see 
how objects are being merged into clusters. From any HC one can obtain a PC by 
restricting oneself to a "horizontal" cut through the dendrogram, while one cannot 
go in the other direction and obtain a full hierarchy from a single PC. Because of 
their wide spread of applications, there are a large variety of different clustering 
methods in use ||T9|. In the following we shall only deal with agglomerative hierar- 
chical clustering, where clusters are built by joining first the most obvious objects 
into pairs, and then continues to join build up larger and larger objects. Thus the 
tree is built by starting at the leaves, and is finished when the main branches are 
finally joined at the root. This is opposed to algorithms where one starts at the root 
and splits clusters up recursively. In either case, the tree obtained in this way can be 
refined later by restructuring it, e.g. using so-called quartet methods 1341 171. 

The crucial point of all clustering algorithms is the choice of a proximity mea- 
sure. This is obtained from the characteristic vectors and can be either an indicator 
for similarity (i.e. large for similar and small for dissimilar objects), or dissimilar- 
ity. In the latter case it is convenient (but not obligatory) if it satisfies the standard 
axioms of a metric (positivity, symmetry, and triangle inequality). A matrix of all 
pairwise proximities is called proximity matrix. Among agglomerative HC meth- 
ods one should distinguish between those where one uses the characteristic vectors 
only at the first level of the hierarchy and derives the proximities between clusters 
from the proximities of their constituents, and methods where the proximities are 
calculated each time from their characteristic vectors. The latter strategy (which is 
used also in the present paper) allows of course for more flexibility but might also 
be computationally more costly. There exist a large number of different strategies 
lfT9l[30l . and the choice of the optimal strategy depends on the characteristics of the 
similarities: for ultrametric distances, e.g., the "natural" method is UPGMA ISOll . 
while neighbor joining is the natural choice when the distance matrix is a metric 
satisfying the four-point condition dij + d^i < max{dii; + dji,dii +djk) 1301 . 

In the present chapter we shall use proximities resp. distances derived from mu- 
tual information fg)- In that case the distances neither form an ultrametric, nor do 
they satisfy the above four-point condition. Thus neither of the two most popular 
agglomerative clustering methods are favored. But we shall see that the distances 
have another special feature which suggests a different clustering strategy discussed 
first in Il2ni23]| . 

Quite generally, the "objects" to be clustered can be either single (finite) patterns 
(e.g. DNA sequences) or random variables, i.e. probability distributions. In the latter 
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case the data are usually supplied in form of a statistical sample, and one of the 
simplest and most widely used similarity measures is the linear (Pearson) correlation 
coefficient. But this is not sensitive to nonlinear dependencies which do not manifest 
themselves in the covariance and can thus miss important features. This is in contrast 
to mutual information (MI) which is also singled out by its information theoretic 
background ||9|. Indeed, MI is zero only if the two random variables are strictly 
independent. 

Another important feature of MI is that it has also an "algorithmic" cousin, de- 
fined within algorithmic (Kolmogorov) information theory ll26l which measures 
the similarity between individual objects. For a comparison between probabilistic 
and algorithmic information theories, see ifTTl . For a thorough discussion of dis- 
tance measures based on algorithmic MI and for their application to clustering, see 
El EH 16). 

Another feature of MI which is essential for the present application is its grouping 
property: The MI between three objects (distributions) X,Y, and Z is equal to the 
sum of the MI between X and Y, plus the MI between Z and the combined object 
(joint distribution) [XY), 

IiX,Y,Z)^I{X,Y)+I{{X,Y),Z). (1) 

Within Shannon information theory this is an exact theorem (see below), while it 
is true in the algorithmic version up to the usual logarithmic correction terms ll26l . 
Since X,Y, and Z can be themselves composite, Eq.([T]i can be used recursively for a 
cluster decomposition of MI. This motivates the main idea of our clustering method: 
instead of using e.g. centers of masses in order to treat clusters like individual objects 
in an approximative way, we treat them exactly like individual objects when using 
MI as proximity measure. 

More precisely, we propose the following scheme for clustering « objects with 
MIC: 

(1) Compute a proximity matrix based on pairwise mutual informations; assign n 
clusters such that each cluster contains exactly one object; 

(2) find the two closest clusters / and j; 

(3) create a new cluster (ij) by combining / and j; 

(4) delete the lines/columns with indices ; and j from the proximity matrix, and 
add one line/column containing the proximities between cluster (ij) and all other 
clusters. These new proximities are calculated by either treating {Xi,Xj) as a single 
random variable (Shannon version), or by concatenating Xi and Xj (algorithmic ver- 
sion); 

(5) if the number of clusters is still > 2, goto (2); else join the two clusters and stop. 
In the next section we shall review the pertinent properties of MI, both in the 

Shannon and in the algorithmic version. This is applied in Sec. [3] to construct a 
phylogenetic tree using mitochondrial DNA and in Sec.|4]to cluster the output chan- 
nels of an independent component analysis (ICA) of an electrocardiogram (ECG) 
of a pregnant woman, and to reconstruct from this the maternal and fetal ECGs. We 
finish with our conclusions in Sec. |5] 
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2 Mutual Information 
2.1 Shannon Theory 

Assume that one has two random variables X and Y. If they are discrete, we write 
Pi{X) = prob(X = X,), pi{Y) = prob(}' = x,), and pij = pmh{X = Xi,Y ~ yi) for 
the marginal and joint distribution. Otherwise (and if they have finite densities) we 
denote the densities by fix (x) , fly (y) , and iJ.{x,y). Entropies are defined for the dis- 
crete case as usual by //(X) = ~^iPi{X)logpi{X), H{Y) = -I,/5,(i')logp;(>'), 
and H{X,Y) = — Pij^ogpij. Conditional entropies are defined as H{X\Y) = 
H{X,Y) —H{Y) = — Li.;\/'i; logp,|^. The base of the logarithm determines the units 
in which information is measured. In particular, taking base two leads to informa- 
tion measured in bits. In the following, we always will use natural logarithms. The 
MI between X and Y is finally defined as 

I{X,Y) = H{X) +H{Y) -H{X,Y) 

It can be shown to be non-negative, and is zero only when X and Y are strictly 
independent. For n random variables Xi .Xo ■ ■ -X,,, the MI is defined as 

I{Xx,...,X„) = Y^H{Xk)-H{Xx,...,Xn). (3) 

k=\ 

This quantity is often referred to as (generalized) redundancy, in order to distinguish 
it from different "mutual informations" which are constructed analogously to higher 
order cumulants |9 j, but we shall not follow this usage. Eq.© can be checked easily, 

/(X,y,Z) = H{X)+H{Y)+H{Z) -H{X,Y,Z) 

f~^k PiWpAn Pu{xY)pk{z) 

= I{X,Y)+I{{X,Y),Z), (4) 

together with its generalization to arbitrary groupings. It means that MI can be de- 
composed into hierarchical levels. By iterating it, one can decompose I{Xi . . .X„) 
for any n > 2 and for any partitioning of the set {Xi . . .X„) into the Mis between 
elements within one cluster and Mis between clusters. 

For continuous variables one first introduces some binning ('coarse-graining'), 
and applies the above to the binned variables. If x is a vector with dimension m and 
each bin has Lebesgue measure A, then pi{X) « /Xx(x)zi™ with x chosen suitably in 
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//bi„(X)«//(X)-mlogZi (5) 
where the differential entropy is given by 

H{X) = - j dx llx (x) log pLx (x) . (6) 

Notice that i7bin(^) is a true (average) information and is thus non-negative, but 
H{X) is not an information and can be negative. Also, H{X) is not invariant under 
homeomorphisms X ^(x). 

Joint entropies, conditional entropies, and Ml are defined as above, with sums 
replaced by integrals. Like ^(X), joint and conditional entropies are neither positive 
(semi-)definite nor invariant. But MI, defined as 

I{XJ) = // dxdy ^xY{x,y) log -j^f^^ , (7) 

is non-negative and invariant under x ^ (x) and y ^ i//(y ) . It is (the limit of) a true 
information, 

/(X,y)= lim[//bin(^)+//bin(}')-//bin(^,}')]. (8) 

4^0 



2.2 Estimating mutual Shannon information 

In applications, one usually has the data available in form of a statistical sample. 
To estimate I{X,Y) one starts from bivariate measurements (x,,3',), / = l,...N 
which are assumed to be iid (independent identically distributed) reahzations. For 
discrete variables, estimating the probabilities pij, etc., is straightforward: is 
just approximated by the ratio nt/N, where is the number of outcomes X —Xi. This 
approximation gives rise both to a bias in the estimate of entropies, and to statistical 
fluctuations. The bias can be largely avoided by more sophisticated methods (see 
e.g. |[T6l ). but we shall not go into details. 

For continuous variables, the situation is worse. There exist numerous strate- 
gies to estimate I{X,Y). The most popular include discretization by partitioning the 
ranges of X and Y into finite intervals jlOJ , "soft" or "fuzzy" partitioning using B- 
splines ifTTI . and kernel density estimators ll28l . We shall use in the following the 
MI estimators based on A:-nearest neighbors statistics proposed in Ref. ll22l . and we 
refer to this paper for a comparison with alternative methods. 



Notice that we have here assumed that densities really exists. If not e.g. if X lives on a fractal 
set), then ;n is to be replaced by the Hausdorff dimension of the measure jx . 
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2.3 k-nearest neighbors estimators 

There exists an extensive literature on nearest neighbors based estimators for the 
simple Shannon entropy 

H{X)^- j dxii{x)\ogii{x), (9) 

dating back at least to lfT2l [38l . But it seems that these methods have never been 
used for estimating MI. In JMl [H El El El [221 |39l it is assumed that x is one- 
dimensional, so that the X, can be ordered by magnitude andx,+i — x, for °°. 
In the simplest case, the estimator based only on these distances is 

H{X) « L'logCv,+i -X,) - v/(l) + Y{N) ■ (10) 

Here, y/{x) is the digamma function, yf{x) = F(x)^'^fF(x)/iix. It satisfies the re- 
cursion \j/{x + 1) = i^(x) + 1 /x and = -C where C = 0.5772156 ... is the 
Euler-Mascheroni constant. For large x, \//(x) ?» logx — 1 /2x. Similar formulas exist 
which use Xj^i^ — x, instead of x,+ 1 — x,, for any integer k < N. 

Although Eg. ([Toll and its generalizations to > 1 seem to give the best estimators 
of H{X), they cannot be used for MI because it is not obvious how to generalize 
them to higher dimensions. Here we have to use a slightly different approach, due 

to Eo). 

Assume some metrics to be given on the spaces spanned hy X,Y and Z = {X,Y). 
In the following we shall use always the maximum norm in the joint space, i.e. 

||z-z'||=max{||x-x'||,||y-y||}, (11) 

independently of the norms used for | |x — x' 1 1 and | — y' 1 1 (they need not be the 
same, as these spaces could be completely different). We can then rank, for each 
pointz, = (x;,y,), its neighbors by distance c/ij- = | jz; — z^j |: <dijj < .... 

Similar rankings can be done in the subspaces X and Y. The basic idea of ||20l is to 
estimate H{X) from the average distance to the A:-nearest neighbor, averaged over 
all X,. Mutual information could be obtained by estimating in this way H{X), H{Y) 
and H{X,Y) separately and using 

I{X,Y)=H{X)+H{Y)-H{X,Y) . (12) 

But using the same k in both the marginal and joint spaces would mean that the typi- 
cal distances to the ^— th neighbors are much larger in the joint (Z) space than in the 
marginal spaces. This would mean that any errors made in the individual estimates 
would presumably not cancel, since they depend primarily on these distances. 

Therefore we proceed differently in ll22l . We first choose a value of k, which 
gives the number of neighbors in the joint space. From this we obtain for each point 
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Zi = {xi,yi) a length scale £,-, and then we count the number of neighbors within this 
distance for each of the marginal points Xj and y, . 

Indeed, for each k two different versions of this algorithm were given in 1221 . 
In the first, neighborhoods in the joint space are chosen as (hyper-)squares, so that 
the the length scales e, are the same in x and in y. In the second version, the size 
of the neighborhood is further optimized by taking them to be (hyper- )rectangles, 
so that e, X £i.y Also, the analogous estimators for the generalized redundancies 
I{Xi,X2, . . .X,n) were given there. Both variants give very similar results. For details 
seeRef. Il22l . 

Compared to other estimators, these estimators are of similar speed (they are 
faster than methods based on kernel density estimators, slower than the B-spline es- 
timator of 1 11 1, and of comparable speed to the sophisticated adaptive partitioning 
method of [10]. They are rather robust (i.e., they are insensitive to outliers). Their 
superiority becomes most evident in higher dimensions, where any method based on 
partitioning fails. Any estimator has statistical errors (due to sample-to-sample fluc- 
tuations) and some bias. Statistical errors decrease with k, while the bias increases 
in general with k. Thus it is advised to take large k (up to k/N ~ 0. 1) when the bias 
is not expected be a problem, and to use small k (k= 1, in the extreme), if a small 
bias is more important than small statistical sample-to-sample fluctuations. 




0.01 0.02 0.03 0.04 0.05 

1 /N 

Fig. 1 Averages of the estimates of /'^'(X,F) — Iexact{X ,Y) for Gaussians with unit variance and 
covariances r = 0.9,0.6,0.3, and 0.0 (from top to bottom), plotted against 1 /N. In all cases k= I. 
The number of realizations over which this is averaged is > 2 x 10^ for W <= 1000, and decreases 
to f« lO"* for N = 40, 000. Error bars are smaller than the sizes of the symbols. 



A systematic study of the performance of these estimators and comparison with 
previous algorithms is given in Ref. 1221 . Here we will discuss just one further fea- 
ture of the estimators proposed in l22ll : They seem to be strictly unbiased whenever 
the true mutual information is zero. This makes them particularly useful for a test 
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for independence. We have no analytic proof for this, but very good numerical ev- 
idence. As an example, we show in Fig. [T] results for Gaussian distributions. More 
precisely, we drew a large number (typically 10^ and more) of A^— tuples of vectors 
{x,y) from a bivariate Gaussian with fixed covariance r, and estimated I{X,Y) with 
k= Ihy means of the second variant /P' 0^,Y) of our estimator The averages over 
all tuples ofP\xj)-lGaussiX,Y) is plotted in Fig.fflagainst 1 /N. Here, 

/Gauss(X,n = -^log(l -'■')■ (13) 

is the exact MI for Gaussians with covariance r ifTOl . 

The most conspicuous feature seen in Fig. [l] apart from the fact that indeed 
I^^\x,Y) — Ig!uiss{X ,Y) for ^ oo, is that the systematic error is compatible 
with zero for r = 0, i.e. when the two Gaussians are uncorrected. We checked this 
with high statistics runs for many different values of k and (a priori one should ex- 
pect that systematic errors become large for very small A^), and for many more distri- 
butions (exponential, uniform, ...). In all cases we found that both variants l''^\x,Y) 
and /'^' {X,Y) become exact for independent variables. 



2.4 Algorithmic Information Theory 

In contrast to Shannon theory where the basic objects are random variables and 
entropies are average informations, algorithmic information theory deals with in- 
dividual symbol strings and with the actual information needed to specify them. 
To "specify" a sequence X means here to give the necessary input to a universal 
computer U, such that U prints X on its output and stops. The analogon to entropy, 
called here usually the complexity K{X) of X, is the minimal length of any input 
which leads to the output X, for fixed U. It depends on U, but it can be shown that 
this dependence is weak and can be neglected in the limit when K{X) is large 1261 l9l. 

Let us denote the concatenation of two strings X and Y as XY. Its complexity is 
K{XY). It is intuitively clear that K{XY) should be larger than K{X) but cannot be 
larger than the sum K{X) +K{Y). Even if X and Y are completely unrelated, so that 
one cannot learn anything about Y by knowing X, K{XY) is slightly smaller that 
K{X) + K{Y). The reason is simply that the information needed to reconstruct XY 
(which is measured by K{XY)) does not include the information about where X ends 
and Y starts (which is included of course in K{X) +K{Y)). The latter information 
increases logarithmically with the total length A^ of the sequence XY. It is one of 
the sources for ubiquitous terms of order log(A^) which become irrelevant in the 
limit N ^ °°, but make rigorous theorems in algorithmic information theory look 
unpleasant. 

Up to such terms, K{X) satisfies the following seemingly obvious but non-trivial 
properties JS): 



1. ldempotsncy:K{XX)^K{X) 
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2. Monotonicity: K{XY) > K{X) 

3. Symmetry: K{XY) = K{YX) 

4. Distributivity: K{XY) +K{Z) < K{XZ)+K{YZ) 

Finally, one expects that K{X\Y), defined as the minimal length of a program 
printing X when Y is furnished as auxiliary input, is related to K{XY) — K{Y). In- 
deed, one can show ll26l (again within correction terms which become irrelevant 
asymptotically) that 

0<K{X\Y)c^K{XY)-K{Y) <K{X). (14) 

Notice the close similarity with Shannon entropy. 

The algorithmic information in Y about X is finally defined as 

/aig(X,y) - KiX) -KiX\Y) 0^ K{X) +K{Y)-K{XY). (15) 

Within the same additive correction terms, one shows that it is symmetric, /ai<,(X,F) = 
/aig(F,X), and can thus serve as an analogon to mutual information. 

From Turing's halting theorem it follows that K{X) is in general not computable. 
But one can easily give upper bounds. Indeed, the length of any input which pro- 
duces X (e.g. by spelling it out verbatim) is an upper bound. Improved upper bounds 
are provided by any file compression algorithm such as gnuzip or UNIX "com- 
press". Good compression algorithms will give good approximations to K{X), and 
algorithms whose performance does not depend on the input file length (in partic- 
ular since they do not segment the file during compression) will be crucial for the 
following. As argued in ||6l, it is not necessary that the compression algorithm gives 
estimates of K{X) which are close to the true values, as long as it satisfies points 1-4 
above. Such a compression algorithm is called normal in [[6]. While the old UNIX 
"compress" algorithm is not normal (idempotency is badly violated), most modern 
compression algorithms (see fl] for an exhaustive overview) are close to normal. 

Before leaving this subsection, we should mention that K{X\Y) can also be es- 
timated in a completely different way, by aligning X and Y |[29l . If X and Y are 
sufficiently close so that a global alignment makes sense, one can form from such 
an alignment a translation string Ty^x which is such that Y and Ty^x together 
determine X uniquely. Then K(Ty^x) is an upper bound to K{X\Y), and can be es- 
timated by compressing Ty^x- In ||29ll this was applied among others to complete 
mitochondrial genomes of vertebrates. The estimates obtained with state of the art 
global alignment and text compression algorithms (MAVID IZTl and Ipaql U) were 
surprisingly close to those obtained by the compression-and-concatenation method 
with the gene compression algorithm XM |[40l . The latter seems at present by far 
the best algorithm for compressing DNA sequences. Estimates for K{X\Y) obtained 
with other algorithms such as gencompress ||2l were typically smaller by nearly an 
order of magnitude. 
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2.5 Mi-Based Distance Measures 



Mutual information itself is a similarity measure in the sense that small values imply 
large "distances" in a loose sense. But it would be useful to modify it such that the 
resulting quantity is a metric in the strict sense, i.e. satisfies the triangle inequality. 
Indeed, the first such metric is well known within Shannon theory ID: The quantity 

d{X,Y)=H{X\Y)+H{Y\X)^H{X,Y) -I{X,Y) (16) 

satisfies the triangle inequality, in addition to being non-negative and symmetric 
and to satisfying d{X,X) =0. The analogous statement in algorithmic information 
theory, with H{X,Y) and I{X,Y) replaced by K{XY) and h\o{X,Y), was proven in 



But d{X,Y) is not appropriate for our purposes. Since we want to compare the 
proximity between two single objects and that between two clusters containing 
maybe many objects, we would like the distance measure to be unbiased by the 
sizes of the clusters. As argued forcefully in Il24ll25l . this is not true for /aig(X,y), 
and for the same reasons it is not true for I{X,Y) or d{X,Y) either: A mutual infor- 
mation of thousand bits should be considered as large, if X and Y themselves are 
just thousand bits long, but it should be considered as very small, if X and Y would 
each be huge, say one billion bits. 

As shown in ll24l |251 within the algorithmic framework, one can form two dif- 
ferent distance measures from MI which define metrics and which are normalized. 
As shown in |2il (see also pT)). the proofs of ll24l |251 can be transposed almost 
verbatim to the Shannon case. In the following we quote only the latter 

Theorem 1. The quantity 

DiX,Y)^l-Ipn.^^ (17) 
^ ' ' H{X,Y) H{X,Y) ^ ' 

is a metric, with D{X,X) ~ and D{X,Y) < I for all pairs (X^Y). 

Theorem 2. The quantity 

n'(X = 1 ^i^'Y) 

^ max{H{X),H{Y)} 
^ meix{HiX\Y),H{Y\X)} 

max{H{X),H{Y)} ^ ' 

is also a metric, also with D'{X,X) ~ and D'{X,Y) < I for all pairs (X,Y). It is 
sharper than D, i.e. D'(X,Y) < D{X,Y). 

Apart from scaling correctly with the total information, in contrast to d{X,Y), the 
algorithmic analogs to D{X,Y) andD'{X,Y) are also universal 125] . Essentially this 
means that if X « F according to any non-trivial distance measure, then X kY also 
according to D, and even more so (by a factor up to 2) according to D'. In contrast 
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to the other properties of D and D' , this is not easy to carry over from algorithmic to 
Shannon theory. The proof in Ref. |[25| depends on X and Y being discrete, which is 
obviously not true for probability distributions. Based on the universality argument, 
it was argued in ||251 that D' should be superior to D, but the numerical studies 
shown in that reference did not show a clear difference between them. In addition, 
D is singled out by a further property: 

Theorem 3. Let X and Y be two strings, and let W be the concatenation W = XY. 
Then W is a weighted "mid point" in the sense that 

D{X,W)+D{W,Y)^D{X,Y), D{X ,W) : D{Y ,W) = H{Y\X) : H{X\Y). (19) 

Proof. We present the proof in its Shannon version. The algorithmic version is ba- 
sically the same, if one neglects the standard logarithmic correction terms. 

Since H{X\W) = 0, one has I{X,W) = H{X). Similarly, H{X,W) = H{X,Y). 
Thus 

zxx.w, .m.y) ^ 2 - ^w±«n ^ 1 - ^ ^ m.y), ,20, 

which proofs the first part. The second part is proven similarly by straightforward 
calculation. 

For D' one has only the inequalities D'{X,Y) <D'{X,W)+D'{W,Y) <D{X,Y). 

Theorem 3 provides a strong argument for using D in MIC, instead of D' . This 
does not mean that D is always preferable to D' . Indeed, we will see in the next sec- 
tion that MIC is not always the most natural clustering algorithm, but that depends 
very much on the application one has in mind. Anyhow, we found numerically that 
in all cases D gave at least comparable results as D' . 

A major difficulty appears in the Shannon framework, if we deal with continuous 
random variables. As we mentioned above. Shannon informations are only finite for 
coarse-grained variables, while they diverge if the resolution tends to zero. This 
means that dividing MI by the entropy as in the definitions of D and D' becomes 
problematic. One has essentially two alternative possibilities. The first is to actually 
introduce some coarse-graining, although it would not have been necessary for the 
definition of I{X,Y), and divide by the coarse-grained entropies. This introduces an 
arbitrariness, since the scale A is completely ad hoc, unless it can be fixed by some 
independent arguments. We have found no such arguments, and thus we propose 
the second alternative. There we take Zi — > 0. In this case H{X) ^ m^logA, with 
nix being the dimension of X. In this limit D and D' would tend to 1. But using 
similarity measures 

S{X,Y) = {\-D{XJ))\og{\/A), (21) 
S'{X,Y) = {\-D'{X,Y))\og{\/A) (22) 



instead of D and D' gives exactly the same results in MIC, and 
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S{X,Y) 



nix + niy ' 



S'{X,Y) 



msx{mx,niy\ 



I{X,Y) 



(23) 



Thus, when dealing with continuous variables, we divide the MI either by the sum or 
by the maximum of the dimensions. When starting with scalar variables and whenX 
is a cluster variable obtained by joining m elementary variables, then its dimension 
is just m, = m. 



2.6 Properties and Presentation of MIC Trees 

MIC gives rooted trees: The leaves are original sequences/variables X, . . . ,Z, inter- 
nal nodes correspond to subsets of the set of all leaves, and the root represents the 
total set of all leaves, i.e. the joint variable {X . . .Z). A bad choice of the metric 
and/or of the clustering algorithm will in general manifest itself in long "caterpillar- 
like" branches, while good choices will tend towards more balanced branchings. 

When presenting the tree, it is useful to put all leaves on the x-axis, and to use in- 
formation about the distances/similarities to control the height. We have essentially 
two natural choices, illustrated in Figs. [2] and |5] respectively. In Fig.|5j the height of 
a node is given by the mutual information between all leaves in the subtree below 
it. If the node is a leave, its height is zero. If it is an internal node (including the 
root) corresponding to a subset ,y of leaves, then its height is given by the mutual 
information between all members of 



Let us assume that ,y has the two daughters X and Y, i.e. ,9^ = {XY). X and Y 
themselves might be either leaves or also internal nodes. Then the grouping property 
implies that height {-5^) - height {X) = I{y) - I{X) is the MI between X and all the 
other sequences/variables in which are not contained in X. This is non-negative 
by the positivity of MI, and the same is true when X is exchanged with Y . Thus 
the trees drawn in this way are always well formed in the sense that a mother node 
is located above its daughters. Any violation of this rule must be due to imprecise 
estimation of Mi's. 

A drawback of this method of representing the tree is that very close pairs have 
long branches, while relatively distant pairs are joined by short branches. This is 
the opposite of what is usually done in phylogenetic trees, where the branch lengths 
are representative of the distance between a mother and its daughter This can be 
achieved by using the alternative representation employed in Fig. |2] There, the 
height of a mother node W which has two daughters X and Y is given by 



height{y)=I[y) 



method 1. 



(24) 



height{W)^D{X,Y) 



method2. 



(25) 



Although it gives rise to more intuitive trees, it also has one disadvantage: It is no 
longer guaranteed that the tree is well formed, but it may happen that height (W) < 
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height{X). To see this, consider a tree formed by three variables X,Y, and Z which 
are pairwise independent but globally dependent: I{X,Y) ~ I{X,Z) ~ I{Y,Z) = 
but I{X,Y,Z) >CB In this case, all pairwise distances are maximal, thus also the 
first pair to be joined has distance 1 . But the height of the root is less than 1 . In our 
numerical applications we indeed found occasionally such "glitches", but they were 
rare and were usually related to imprecise estimates of MI or to improper choices of 
the metric. 



3 Mitochondrial DNA and a Phylogenetic Tree for Mammals 

As a first application, we study the mitochondrial DNA of a group of 34 mammals 
(see Fig.|2|i. Exactly the same species [3 1 had previously been analyzed in ll24l[3l 
I2TI . This group includes non-eutherian^ , rodents and lagomorpho ferungulate^ . 
primate^ members of the African clade J, and other^ It had been chosen in ||24 
because of doubts about the relative closeness among these three groups ll5l [3Tl . 

Obviously, we are here dealing with the algorithmic version of information the- 
ory, and informations are estimated by lossless data compression. For constructing 
the proximity matrix between individual taxa, we proceed essentially a in Ref. Il24l . 
But in addition to using the special compression program GenCompress JSJ, we also 
tested several general purpose compression programs such as BWTzip, the UNIX 
tool bzip2, and Ipaql H], and durilca |T]. Finally, we also tested XM Bol (the ab- 
breviation stands for "expert model"), which is according to its authors the most 
efficient compressor for biological (DNA, proteins) sequences. Indeed, we found 
that XM was even better than expected. While the advantage over GenCompress 
and other compressors was a few per cent when applied to single sequences ID, the 
estimates for MI between not too closely related species were higher by up to an 
order of magnitude. This is possible mainly because the Mi's estimated by means 
of GenCompress and similar methods are reliably positive (unless the species are 



^ An example is provided by three binary random variables with pooo = Poii = Pioi = Piio = 
1/2 -I- e and Pool = Poio = Pioo =piii = I/2-e. 

' opossum (Didelphis virginiana), wallaroo {Macropus robustus), and platypus {Ornithorhyncus 
analinus) 

* rabbit {Oryctolagus cuniculus), guinea pig (Cavia porceUus), fat dormouse {Myoxus glis), rat 
(Rattus norvegicus), squirrel (Sciurus vulgaris), and mouse {Mus musculus) 
^ horse {Equus caballus), donkey (Equus asinus), Indian rhinoceros (Rhinoceros unicornis), white 
rhinoceros (Ceratotherium simum), harbor seal (Phoca vitulina), grey seal (Halichoerus grypus), 
cat {Felis catus), dog (Canis hipus familiaris), fin whale (Balaenoptera physahis), blue whale 
(Balenoplera musculus), cow (Bos taurus), sheep (Ovis aries), pig (Sus scrofa) and hippopotamus 
(Hippopotamus amphibius) 

^ human (Homo sapiens), common chimpanzee (Pan troglodytes), pigmy chimpanzee (Pan panis- 
cus), gorilla (Gorilla gorilla), orangutan (Pongo pygmaeus), gibbon (Hylobates lar), and baboon 
(Papio hamadryas) 

' African elephant (Loxodonta africana), aardvai'k (Orycteropus afer) 

^ Jamaican fruit bat (Artibeus jamaicensis), amiadillo (Dasypus novemcintus) 
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from different phyla) but extremely small. Thus even a small improvement on the 
compression rate can make a big change in MI. 

In Ref. ||24l . the proximity matrix derived from MI estimates was then used as the 
input to a standard HC algorithm (neighbor-joining and hypercleaning) to produce 
an evolutionary tree. It is here where our treatment deviates crucially. We used the 
MIC algorithm described in Sec.[T] with distance D{X,Y). The joining of two clus- 
ters (the third step in the MIC algorithm) is obtained by simply concatenating the 
DNA sequences. There is of course an arbitrariness in the order of concatenation se- 
quences: XY and YX give in general compressed sequences of different lengths. But 
we found this to have negligible effect on the evolutionary tree. The resulting evolu- 
tionary tree obtained with Gencompress is shown in Fig.|2] while the tree obtained 
with XM is shown in Fig. [3] 




Fig. 2 Phylogenetic tree for 34 mammals (31 eutherians plus 3 non-placenta mammals), with 
mutual informations estimated by means of GenCompress. In contrast to Fig. |6] the heights of 
nodes are here and in the following tree equal to the distances between the joining daughter clusters. 

Both trees are quite similar, and they are also similar to the most widely ac- 
cepted phylogeny found e.g. in (Sj- All primates are e.g. correctly clustered and 
the ferungulates are joined together There are however a number connections (in 
both trees) which obviously do not reflect the true evolutionary tree. , As shown in 
Fig.|2]the overall structure of this tree closely resembles the one shown in Ref. ||3TI . 
All primates are correctly clustered and also the relative order of the ferungulates 
is in accordance with Ref. ifSTl . On the other hand, there are a number of connec- 
tions which obviously do not reflect the true evolutionary tree, see for example the 
guinea pig with bat and elephant with platypus in Fig.|2] and the mixing of rodents 
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Fig. 3 Same as in Fig. [2] but with mutual informations estimated by means of XM. Notice that the 
X-axis covers here, in contrast to Fig. [2] the entire interval from to I. 

with African clade and armadillo in Fig.|3] But all these wrong associations are be- 
tween species which have a very large distance from each other and from any other 
species within this sample. All in all, the results shown in Figs. |2] and |2] are in sur- 
prisingly good agreement (showing that neither compression scheme has obvious 
faults) and capture surprisingly well the known relationships between mammals. 
Dividing MI by the total information is essential for this success. If we had used the 
non-normalized I^io{X,Y) itself, results obtained in ||24| would not change much, 
since all 34 DNA sequences have roughly the same length. But our MIC algorithm 
would be completely screwed up: After the first cluster formation, we have DNA 
sequences of very different lengths, and longer sequences tend also to have larger 
MI, even if they are not closely related. 

One recurrent theme in the discussion of mammalian phylogenetic trees is the 
placement of the rodents 113111321 . Are they closer to ferungulates, or closer to pri- 
mates? Our results are inconclusive. On the one hand, the average distances between 
all 14 rodents and all 104 ferungulates in the Genebank data (Feb. 2008), estimated 
with XM, is 0.860 - which is clearly smaller than the average distance 0.881 to all 
28 primates. On the other hand, the distances within the group of rodents are very 
large, suggesting that this group is either not monophylic, or that its mtDNA has 
evolved much faster than, say, that of ferungulates. Either possibility makes a state- 
ment about the classification of rodents with respect to ferungulates and primates 
based on these data very uncertain. 

A heuristic reasoning for the use of MIC for the reconstruction of an evolutionary 
tree might be given as follows: Suppose that a proximity matrix has been calculated 
for a set of DNA sequences and the smallest distance is found for the pair {X,Y). 
Ideally, one would remove the sequences X and Y, replace them by the sequence of 
the common ancestor (say Z) of the two species, update the proximity matrix to find 
the smallest entry in the reduced set of species, and so on. But the DNA sequence 
of the common ancestor is not available. One solution might be that one tries to 
reconstruct it by making some compromise between the sequences X and Y. Instead, 
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we essentially propose to concatenate the sequences X and Y. This will of course not 
lead to a plausible sequence of the common ancestor, but it will optimally represent 
the information about the common ancestor During the evolution since the time of 
the ancestor Z, some parts of its genome might have changed both in X and in Y . 
These parts are of little use in constructing any phylogenetic tree. Other parts might 
not have changed in either They are recognized anyhow by any sensible algorithm. 
Finally, some parts of its genome will have mutated significantly in X but not in Y, 
and vice versa. This information is essential to find the correct way through higher 
hierarchy levels of the evolutionary tree, and it is preserved in concatenating. 

In any case, this discussion shows that our clustering algorithm produces trees 
which is closer in spirit to phenetic trees than to phylogenetic trees proper. As we 
said, in the latter the internal nodes should represent actual extinct species, namely 
the last common ancestors. In our method, in contrast, an internal node does not 
represent a particular species but a higher order clade which is defined solely on 
the basis of information about presently living species. In the phylogenetic context, 
purely phenetic trees are at present much less popular than trees using evolution- 
ary information. This is not so in the following application, where no evolutionary 
aspect exists and the above discussion is irrelevant. 



4 Clustering of Minimally Dependent Components in an 
Electrocardiogram 

As our second application we choose a case where Shannon theory is the proper 
setting. We show in Fig. |4] an ECG recorded from the abdomen and thorax of a 
pregnant woman |4] (8 channels, sampling rate 500 Hz, 5 s total). It is already seen 
from this graph that there are at least two important components in this ECG: the 
heartbeat of the mother, with a frequency of « 3 beat/s, and the heartbeat of the fetus 
with roughly twice this frequency. Both are not synchronized. In addition there is 
noise from various sources (muscle activity, measurement noise, etc.). While it is 
easy to detect anomalies in the mother's ECG from such a recording, it would be 
difficult to detect them in the fetal ECG. 

As a first approximation we can assume that the total ECG is a linear super- 
position of several independent sources (mother, child, noisei, noise2,...). A stan- 
dard method to disentangle such superpositions is independent component analysis 
(ICA) ifTSl . In the simplest case one has n independent sources i,(r), i~ 1 ...n and 
n measured channels Xi{t) obtained by instantaneous superpositions with a time in- 
dependent non-singular matrix A, 

^KO = LV^W- (26) 

In this case the sources can be reconstructed by applying the inverse transformation 
W = A^' which is obtained by minimizing the (estimated) mutual informations 
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between the transformed components y, (f ) = L'j=i ^i;^; (0- some of the sources 
are Gaussian, this leads to ambiguities ifTSl . but it gives a unique solution if the 
sources have more structure. 

In reality things are not so simple. For instance, the sources might not be inde- 
pendent, the number of sources (including noise sources!) might be different from 
the number of channels, and the mixing might involve delays. For the present case 
this implies that the heartbeat of the mother is seen in several reconstructed compo- 
nents y,, and that the supposedly "independent" components are not independent at 
all. In particular, all components y,- which have large contributions from the mother 
form a cluster with large intra-cluster Mis and small inter-cluster Mis. The same is 
true for the fetal ECG, albeit less pronounced. It is thus our aim to 

1) optimally decompose the signals into least dependent components; 

2) cluster these components hierarchically such that the most dependent ones are 
grouped together; 

3) decide on an optimal level of the hierarchy, such that the clusters make most sense 
physiologically; 

4) project onto these clusters and apply the inverse transformations to obtain cleaned 
signals for the sources of interest. 




1 2 seconds 3 4 5 



Fig. 4 ECG of a pregnant woman. 

Technically we proceeded as follows ll33l : 

Since we expect different delays in the different channels, we first used Takens 
delay embedding 1351 with time delay 0.002 s and embedding dimension 3, result- 
ing in 24 channels. We then formed 24 linear combinations y, (f ) and determined the 
de-mixing coefficients Wij by minimizing the overall mutual information between 
them, using the MI estimator proposed in l22] . There, two classes of estimators were 
introduced, one with square and the other with rectangular neighborhoods. Within 
each class, one can use the number of neighbors, called k in the following, on which 
the estimate is based. Small values of k lead to a small bias but to large statistical 
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Fig. 5 Least dependent components of the ECG shown in Fig. |4l after increasing the number of 
channels by delay embedding. 

errors, while the opposite is true for large k. But even for very large k the bias is zero 
when the true MI is zero, and it is systematically such that absolute values of the 
MI are underestimated. Therefore this bias does not affect the determination of the 
optimal de-mixing matrix. But it depends on the dimension of the random variables, 
therefore large values of k are not suitable for the clustering. We thus proceeded as 
follows; We first used A: = 100 and square neighborhoods to obtain the least depen- 
dent components yi{t), and then used k = 3 with rectangular neighborhoods for the 
clustering. The resulting least dependent components are shown in Fig.|5] They are 
sorted such that the first components (1-5) are dominated by the maternal ECG, 
while the next three contain large contributions from the fetus. The rest contains 
mostly noise, although some seem to be still mixed. 

These results obtained by visual inspection are fully supported by the cluster 
analysis. The dendrogram is shown in Fig. |6] In constructing it we used S{X,Y) 
(Eq.(l23])) as similarity measure to find the correct topology. Again we would have 
obtained much worse results if we had not normalized it by dividing MI by nix + ■ 
In plotting the actual dendrogram, however, we used the MI of the cluster to deter- 
mine the height at which the two daughters join. The MI of the first five channels, 
e.g., is sa 1.43, while that of channels 6 to 8 is « 0.34. For any two clusters (tuples) 
X = . . .X„ and y = Fi . . . one has /(X, F) > I{X) + I{Y). This guarantees, if 
the MI is estimated correctly, that the tree is drawn properly. The two slight glitches 
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(when clusters (1-14) and (15-18) join, and when (21-22) is joined with 23) result 
from small errors in estimating MI. They do in no way effect our conclusions. 



1.5- 
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Fig. 6 Dendrogram for least dependent components. The height where the two branches of a clus- 
ter join corresponds to the MI of the cluster. 

In Fig.|6]one can clearly see two big clusters corresponding to the mother and to 
the child. There are also some small clusters which should be considered as noise. 
For reconstructing the mother and child contributions to Fig. ID we have to decide on 
one specific clustering from the entire hierarchy. We decided to make the cut such 
that mother and child are separated. The resulting clusters are indicated in Fig.|6]and 
were already anticipated in sorting the channels. Reconstructing the original ECG 
from the child components only, we obtain Fig.|7] 



1 2 seconds 3 4 5 

Fig. 7 Original ECG where all contributions except those of the child cluster have been removed. 
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5 Conclusions 

We have shown that MI can not only be used as a proximity measure in clustering, 
but that it also suggests a conceptually very simple and natural hierarchical cluster- 
ing algorithm. We do not claim that this algorithm, called mutual information clus- 
tering (MIC), is always superior to other algorithms. Indeed, MI is in general not 
easy to estimate. Obviously, when only crude estimates are possible, also MIC will 
not give optimal results. But as MI estimates are becoming better, also the results 
of MIC should improve. The present paper was partly triggered by the development 
of a new class of MI estimators for continuous random variables which have very 
small bias and also rather small variances 1221 . 

We have illustrated our method with two applications, one from genetics and one 
from cardiology. For neither application MIC might give the very best clustering, 
but it seems promising and indicative of the inherit simplicity of our method that 
one common method gives decent results in two very different applications. 

If better data become available, e.g. in the form of longer time sequences in the 
application to ECG or of more complete genomes (so that mutual information can 
be estimated more reliably), then the results of MIC should improve. It is not obvi- 
ous what to expect when one wants to include more data in order to estimate larger 
trees. On the one hand, more species within one taxonomic clade would describe 
this clade more precisely, so results should improve. On the other hand, as clus- 
ters become bigger and bigger, also the disparities of the sequences lengths which 
describe these clusters increase. It is not clear whether in this case a strict normaliza- 
tion of the distances as in E's. ( 117118b is still appropriate, and whether the available 
compression algorithms will still be able to catch the very long resulting correlations 
within the concatenated sequences. Experiments with phylogenetic trees of animals 
with up to 360 species (unpublished results) had mixed success. 

As we said in the introduction, after a construction of a first tree one can try 
to improve on it. One possibility is to change the topology of the tree, using e.g. 
quartet moves and accepting them based on some heuristic cost function. One such 
cost function could be the sum of all distances between linked nodes in the tree. 
Alternatively, one could try to keep the topology fixed and change the sequences 
representing the internal nodes, i.e. deviate from simple concatenation. We have not 
tried either 

There are two versions of information theory, algorithmic and probabilistic, and 
therefore there are also two variants of MI and of MIC. We discussed in detail one 
application of each, and showed that indeed common concepts were involved in 
both. In particular it was crucial to normalize MI properly, so that it is essentially 
the relative MI which is used as proximity measure. For conventional clustering al- 
gorithms using algorithmic MI as proximity measure this had already been stressed 
in 12411251 . but it is even more important in MIC, both in the algorithmic and in the 
probabilistic versions. 

In the probabilistic version, one studies the clustering of probability distribu- 
tions. But usually distributions are not provided as such, but are given implicitly 
by finite random samples drawn (more or less) independently from them. On the 
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Other hand, the full power of algorithmic information theory is only reached for 
infinitely long sequences, and in this limit any individual sequence defines a se- 
quence of probability measures on finite subsequences. Thus the strict distinction 
between the two theories is somewhat blurred in practice. Nevertheless, one should 
not confuse the similarity between two sequences (two English books, say) and that 
between their subsequence statistics. Two sequences are maximally different if they 
are completely random, but their statistics for short subsequences is then identical 
(all subsequences appear in both with equal probabilities). Thus one should always 
be aware of what similarities or independencies one is looking for. The fact that MI 
can be used in similar ways for all these problems is not trivial. 
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